English - Oromo Machine Translation: An Experiment Using a Statistical Approach
نویسندگان
چکیده
This paper deals with translation of English documents to Oromo using statistical methods. Whereas English is the lingua franca of online information, Oromo, despite its relative wide distribution within Ethiopia and neighbouring countries like Kenya and Somalia, is one of the most resource scarce languages. The paper has two main goals: one is to test how far we can go with the available limited parallel corpus for the English – Oromo language pair and the applicability of existing Statistical Machine Translation (SMT) systems on this language pair. The second goal is to analyze the output of the system with the objective of identifying the challenges that need to be tackled. Since the language is resource scarce as mentioned above, we cannot get as many parallel documents as we want for the experiment. However, using a limited corpus of 20,000 bilingual sentences and 62, 300 monolingual sentences, translation accuracy in terms of BLEU Score of 17.74% was achieved.
منابع مشابه
Augmenting Performance of SMT Models by Deploying Fine Tokenization of the Text and Part-of-Speech Tag
This paper presents our study of exploiting the languages’ word class information augmented with some rule-based processing for phrase-based Statistical Machine Translation (SMT). In statistical machine translation, estimating word-to-word alignment probabilities for the translation model can be difficult due to the problem of sparse data: most words in a given corpus occur at most a handful of...
متن کاملA Hybrid Machine Translation System Based on a Monotone Decoder
In this paper, a hybrid Machine Translation (MT) system is proposed by combining the result of a rule-based machine translation (RBMT) system with a statistical approach. The RBMT uses a set of linguistic rules for translation, which leads to better translation results in terms of word ordering and syntactic structure. On the other hand, SMT works better in lexical choice. Therefore, in our sys...
متن کاملImproving Recall for Hindi, Telugu, Oromo to English CLIR
This paper presents the Cross Language Information Retrieval (CLIR) experiments of the Language Technologies Research Centre (LTRC, IIIT-Hyderabad) as part of our participation in the ad-hoc track of CLEF 2007. We present approaches to improve recall of query translation by handling morphological and spelling variations in source language keywords. We also present experiments using query expans...
متن کاملStatistical Machine Translation of Serbian-English
In this work we present the first results of statistical approach to the machine translation of Serbian language into English and vice versa. The experiments are performed on the Assimil language course, bilingual parallel corpus which consists of about 3k sentences and 20k running words from unrestricted domain. The error rates for the translation of Serbian into English are about 35-45% and f...
متن کاملEvaluation of Oromo-English Cross-Language Information Retrieval
This paper reports on the first Oromo-English CLIR system that is based on dictionary-based query translation techniques. The basic objective of the study is to design and develop an OromoEnglish CLIR system with a view to enable Afaan Oromo speakers to access and retrieve the vast online information sources that are available in English by using their own (native) language queries. We describe...
متن کامل